Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Efficient supervised and semi-supervised approaches for affiliations disambiguation

Identifieur interne : 000198 ( Main/Exploration ); précédent : 000197; suivant : 000199

Efficient supervised and semi-supervised approaches for affiliations disambiguation

Auteurs : Pascal Cuxac [France] ; Jean-Charles Lamirel [France] ; Valerie Bonvallot [France]

Source :

RBID : Pascal:13-0331130

Descripteurs français

English descriptors

Abstract

The disambiguation of named entities is a challenge in many fields such as scientometrics, social networks, record linkage, citation analysis, semantic web...etc. The names ambiguities can arise from misspelling, typographical or OCR mistakes, abbreviations, omissions... Therefore, the search of names of persons or of organizations is difficult as soon as a single name might appear in many different forms. This paper proposes two approaches to disambiguate on the affiliations of authors of scientific papers in bibliographic databases: the first way considers that a training dataset is available, and uses a Naive Bayes model. The second way assumes that there is no learning resource, and uses a semi-supervised approach, mixing soft-clustering and Bayesian learning. The results are encouraging and the approach is already partially applied in a scientific survey department. However, our experiments also highlight that our approach has some limitations: it cannot process efficiently highly unbalanced data. Alternatives solutions are possible for future developments, particularly with the use of a recent clustering algorithm relying on feature maximization.

Url:


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Efficient supervised and semi-supervised approaches for affiliations disambiguation</title>
<author>
<name sortKey="Cuxac, Pascal" sort="Cuxac, Pascal" uniqKey="Cuxac P" first="Pascal" last="Cuxac">Pascal Cuxac</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>INIST-CNRS, Vandoeuvre les Nancy</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
<settlement type="city" wicri:auto="agglo">Nancy</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Lamirel, Jean Charles" sort="Lamirel, Jean Charles" uniqKey="Lamirel J" first="Jean-Charles" last="Lamirel">Jean-Charles Lamirel</name>
<affiliation wicri:level="1">
<inist:fA14 i1="02">
<s1>LORIA-Synalp, Vandoeuvre les Nancy</s1>
<s3>FRA</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
<settlement type="city" wicri:auto="agglo">Nancy</settlement>
</placeName>
<placeName>
<settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Lorraine</region>
</placeName>
<orgName type="team" n="7">Synalp (Loria)</orgName>
<orgName type="lab">Laboratoire lorrain de recherche en informatique et ses applications</orgName>
<orgName type="university">Université de Lorraine</orgName>
<orgName type="EPST">Centre national de la recherche scientifique</orgName>
</affiliation>
</author>
<author>
<name sortKey="Bonvallot, Valerie" sort="Bonvallot, Valerie" uniqKey="Bonvallot V" first="Valerie" last="Bonvallot">Valerie Bonvallot</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>INIST-CNRS, Vandoeuvre les Nancy</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
<settlement type="city" wicri:auto="agglo">Nancy</settlement>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">13-0331130</idno>
<date when="2013">2013</date>
<idno type="stanalyst">PASCAL 13-0331130 INIST</idno>
<idno type="RBID">Pascal:13-0331130</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000041</idno>
<idno type="stanalyst">FRANCIS 13-0331130 INIST</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000071</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000727</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000041</idno>
<idno type="wicri:doubleKey">0138-9130:2013:Cuxac P:efficient:supervised:and</idno>
<idno type="wicri:Area/Main/Merge">000201</idno>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-00960435</idno>
<idno type="url">https://hal.archives-ouvertes.fr/hal-00960435</idno>
<idno type="wicri:Area/Hal/Corpus">000044</idno>
<idno type="wicri:Area/Hal/Curation">000044</idno>
<idno type="wicri:Area/Hal/Checkpoint">000043</idno>
<idno type="wicri:doubleKey">0138-9130:2013:Cuxac P:efficient:supervised:and</idno>
<idno type="wicri:Area/Main/Merge">000145</idno>
<idno type="wicri:Area/Main/Curation">000198</idno>
<idno type="wicri:Area/Main/Exploration">000198</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Efficient supervised and semi-supervised approaches for affiliations disambiguation</title>
<author>
<name sortKey="Cuxac, Pascal" sort="Cuxac, Pascal" uniqKey="Cuxac P" first="Pascal" last="Cuxac">Pascal Cuxac</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>INIST-CNRS, Vandoeuvre les Nancy</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
<settlement type="city" wicri:auto="agglo">Nancy</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Lamirel, Jean Charles" sort="Lamirel, Jean Charles" uniqKey="Lamirel J" first="Jean-Charles" last="Lamirel">Jean-Charles Lamirel</name>
<affiliation wicri:level="1">
<inist:fA14 i1="02">
<s1>LORIA-Synalp, Vandoeuvre les Nancy</s1>
<s3>FRA</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
<settlement type="city" wicri:auto="agglo">Nancy</settlement>
</placeName>
<placeName>
<settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Lorraine</region>
</placeName>
<orgName type="team" n="7">Synalp (Loria)</orgName>
<orgName type="lab">Laboratoire lorrain de recherche en informatique et ses applications</orgName>
<orgName type="university">Université de Lorraine</orgName>
<orgName type="EPST">Centre national de la recherche scientifique</orgName>
</affiliation>
</author>
<author>
<name sortKey="Bonvallot, Valerie" sort="Bonvallot, Valerie" uniqKey="Bonvallot V" first="Valerie" last="Bonvallot">Valerie Bonvallot</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>INIST-CNRS, Vandoeuvre les Nancy</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
<settlement type="city" wicri:auto="agglo">Nancy</settlement>
</placeName>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">Scientometrics : (Print)</title>
<title level="j" type="abbreviated">Scientometrics : (Print)</title>
<idno type="ISSN">0138-9130</idno>
<imprint>
<date when="2013">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">Scientometrics : (Print)</title>
<title level="j" type="abbreviated">Scientometrics : (Print)</title>
<idno type="ISSN">0138-9130</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithm</term>
<term>Bibliographic database</term>
<term>Citation analysis</term>
<term>Classification</term>
<term>Cluster</term>
<term>Research field</term>
<term>Scientific research</term>
<term>Scientometrics</term>
<term>Semantic analysis</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Domaine recherche</term>
<term>Scientométrie</term>
<term>Analyse citation</term>
<term>Analyse sémantique</term>
<term>Base de données bibliographiques</term>
<term>Amas</term>
<term>Algorithme</term>
<term>Classification</term>
<term>Recherche scientifique</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Classification</term>
<term>Recherche scientifique</term>
</keywords>
<keywords scheme="mix" xml:lang="fr">
<term>Clustering</term>
<term>affiliations</term>
<term>classification automatique</term>
<term>désambiguisation</term>
<term>infométrie</term>
<term>texte</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">The disambiguation of named entities is a challenge in many fields such as scientometrics, social networks, record linkage, citation analysis, semantic web...etc. The names ambiguities can arise from misspelling, typographical or OCR mistakes, abbreviations, omissions... Therefore, the search of names of persons or of organizations is difficult as soon as a single name might appear in many different forms. This paper proposes two approaches to disambiguate on the affiliations of authors of scientific papers in bibliographic databases: the first way considers that a training dataset is available, and uses a Naive Bayes model. The second way assumes that there is no learning resource, and uses a semi-supervised approach, mixing soft-clustering and Bayesian learning. The results are encouraging and the approach is already partially applied in a scientific survey department. However, our experiments also highlight that our approach has some limitations: it cannot process efficiently highly unbalanced data. Alternatives solutions are possible for future developments, particularly with the use of a recent clustering algorithm relying on feature maximization.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
</country>
<region>
<li>Lorraine</li>
</region>
<settlement>
<li>Nancy</li>
<li>Vandœuvre-lès-Nancy</li>
</settlement>
<orgName>
<li>Centre national de la recherche scientifique</li>
<li>Laboratoire lorrain de recherche en informatique et ses applications</li>
<li>Synalp (Loria)</li>
<li>Université de Lorraine</li>
</orgName>
</list>
<tree>
<country name="France">
<noRegion>
<name sortKey="Cuxac, Pascal" sort="Cuxac, Pascal" uniqKey="Cuxac P" first="Pascal" last="Cuxac">Pascal Cuxac</name>
</noRegion>
<name sortKey="Bonvallot, Valerie" sort="Bonvallot, Valerie" uniqKey="Bonvallot V" first="Valerie" last="Bonvallot">Valerie Bonvallot</name>
<name sortKey="Lamirel, Jean Charles" sort="Lamirel, Jean Charles" uniqKey="Lamirel J" first="Jean-Charles" last="Lamirel">Jean-Charles Lamirel</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000198 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000198 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:13-0331130
   |texte=   Efficient supervised and semi-supervised approaches for affiliations disambiguation
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024